Introduction

The dataset used for this project is white wine quality dataset.This data set contains information on 1,599 different red wines from a 2009 study. The dataset consists of 11 variables and 4898 observations.The aim of my investigation is to see if any variables affect volatile acidity which in turn affects the quality of the white wine.

Data Summary

The dataset description is shown below. We created a new variable called bound sulfur dioxide which is nothing but total sulfur dioxide subtracted by the free sulfur dioxide.

## [1] 4898   13
## 'data.frame':    4898 obs. of  13 variables:
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##  $ bound.sulfur.dioxide: num  125 118 67 139 139 67 106 125 118 101 ...
##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0       
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0       
##  Median :0.04300   Median : 34.00      Median :134.0       
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4       
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0       
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0       
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50  
##  Median :0.9937   Median :3.180   Median :0.4700   Median :10.40  
##  Mean   :0.9940   Mean   :3.188   Mean   :0.4898   Mean   :10.51  
##  3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40  
##  Max.   :1.0390   Max.   :3.820   Max.   :1.0800   Max.   :14.20  
##     quality      bound.sulfur.dioxide
##  Min.   :3.000   Min.   :  4.0       
##  1st Qu.:5.000   1st Qu.: 78.0       
##  Median :6.000   Median :100.0       
##  Mean   :5.878   Mean   :103.1       
##  3rd Qu.:6.000   3rd Qu.:125.0       
##  Max.   :9.000   Max.   :331.0

Now, I will be performing Univariate, Bivariate and Multivariate analysis.

Univariate Analysis

Volatile Acidity

The distribution appears unimodal with the volatile acidity peaking around 0.28.


Quality

Is there any effect on the quality? What does this plot looks like across the categorical variables of quality.

The majority of white wines have a quality level 5 and 6.


pH Level

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820
## 
##  0.08 0.085  0.09   0.1 0.105  0.11 0.115  0.12 0.125  0.13 0.135  0.14 
##     4     1     1     6     6    13     3    34     3    44     1    56 
## 0.145  0.15 0.155  0.16 0.165  0.17 0.175  0.18 0.185  0.19   0.2 0.205 
##     4    88     5   141     2   140     1   177     5   170   214     4 
##  0.21 0.215  0.22 0.225  0.23 0.235  0.24 0.245  0.25 0.255  0.26 0.265 
##   191     1   229     4   216     4   253     4   231    10   240     5 
##  0.27 0.275  0.28 0.285  0.29 0.295   0.3 0.305  0.31 0.315  0.32 0.325 
##   218     3   263     5   160     3   198     4   148     4   182     2 
##  0.33 0.335  0.34 0.345  0.35 0.355  0.36 0.365  0.37 0.375  0.38 0.385 
##   134     7   135     9    86     1   104     2    65     2    63     2 
##  0.39 0.395   0.4 0.405  0.41 0.415  0.42 0.425  0.43 0.435  0.44 0.445 
##    61     2    59     1    54     4    36     2    35     2    46     4 
##  0.45 0.455  0.46  0.47 0.475  0.48 0.485  0.49 0.495   0.5  0.51  0.52 
##    25     2    30    15     3    17     3    14     2    14    10    10 
##  0.53  0.54 0.545  0.55 0.555  0.56  0.57  0.58 0.585  0.59 0.595   0.6 
##     8    10     1    14     2     9     4     7     2     4     2     7 
##  0.61 0.615  0.62  0.63  0.64  0.65 0.655  0.66  0.67  0.68 0.685  0.69 
##     7     4     5     2     7     2     3     4     5     3     1     2 
## 0.695 0.705  0.71  0.73  0.74  0.75  0.76  0.78 0.785 0.815  0.85 0.905 
##     3     2     1     1     1     1     2     1     1     1     1     1 
##  0.91  0.93 0.965 1.005   1.1 
##     1     1     1     1     1

There is a peak around 3.14. The pH level is probably affected by acidity. Minimum level of pH is 2.720 and maximum is 3.820.


Density

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390

Density has a very small range, from 0.9871 to 1.0390


Alcohol percentage by volume

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

There is a peak around 9.4, and the distribution is skewed to the right.


Sulphates

## 
## 0.22 0.23 0.25 0.26 0.27 0.28 0.29  0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 
##    1    1    4    4   13   13   16   31   35   54   59   84   85  120  129 
## 0.38 0.39  0.4 0.41 0.42 0.43 0.44 0.45 0.46 0.47 0.48 0.49  0.5 0.51 0.52 
##  214  151  168  139  181  161  216  178  225  172  179  166  249  140  156 
## 0.53 0.54 0.55 0.56 0.57 0.58 0.59  0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 
##  135  167  102  108   83   99   97   88   45   68   48   67   28   36   35 
## 0.68 0.69  0.7 0.71 0.72 0.73 0.74 0.75 0.76 0.77 0.78 0.79  0.8 0.81 0.82 
##   44   30   27   18   33   12   19   22   19   16   19   16    5    5   13 
## 0.83 0.84 0.85 0.86 0.87 0.88 0.89  0.9 0.92 0.94 0.95 0.96 0.97 0.98 0.99 
##    2    4    3    2    2    7    1    5    2    2    5    3    1    6    1 
##    1 1.01 1.06 1.08 
##    1    1    1    1

There is a peak around 0.55. Distribution is skewed to the right.

The distribution appears slightly bi-modal with the sulphate concentration peaking around 0.38 and again at 0.5.


Citric Acid

This is a square root transformed histogram plot of citric acid. There’s peak in citric acid concentration around 0.30, and a sudden spike at 0.48. Distribution is normal.


Bound Sulfur Dioxide

This is a square root transformed histogram plot of bound sulfur dioxide. There is a peak in bound sulfur dioxide concentration around 85. Distribution is skewed to the right.


Free Sulfur Dioxide

This is a square root transformed histogram plot of free sulfur dioxide. There is a peak in bound sulfur dioxide concentration around 30. Distribution is skewed to the right.


Q & A

What is the structure of your dataset?

Data-frame consists of 4898 white wines of 12 original variables (Wine id, fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol and quality) + 1 derived variable(Bound Sulphur dioxide). The variable quality is ordered factor variable with the following levels.

Quality: (Worst) 0, 1, ———> , 9,10 (Best)

Salient observations:

  • Most white wines have a quality of 5 or 6
  • Median pH level is 3.180
  • Majority of white wines have between 9 and 13 percent of alcohol

What is/are the main feature(s) of interest in your dataset?

The main feature in the data set is volatile acidity. I wanted to find out how volatile acidity increase or decrease w.r.t the quality of the white wine. I suspect pH and some combination of the other variables can be used to build a predictive model to grade white wines.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I would like to see if the amount residual sugar increases the quality of the white wine, and also if there is any connection with the amount of alcohol in the wine itself.

Did you create any new variables from existing variables in the dataset?

A new variable was created named “bound.sulfur.dioxide”. It was shown in the summary of the data frame and was later used in the bi-variate plots section.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

I found that the alcohol percentage distribution was right skewed compared to the other variables that I investigated. Most of the white wines were below 13% of alcohol. In most of the cases, I removed the outliers to get a better look at the data.

Bivariate Analysis

##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity                1.000           -0.023       0.289
## volatile.acidity            -0.023            1.000      -0.149
## citric.acid                  0.289           -0.149       1.000
## residual.sugar               0.089            0.064       0.094
## chlorides                    0.023            0.071       0.114
## free.sulfur.dioxide         -0.049           -0.097       0.094
## total.sulfur.dioxide         0.091            0.089       0.121
## density                      0.265            0.027       0.150
## pH                          -0.426           -0.032      -0.164
## sulphates                   -0.017           -0.036       0.062
## alcohol                     -0.121            0.068      -0.076
## quality                     -0.114           -0.195      -0.009
## bound.sulfur.dioxide         0.136            0.157       0.102
##                      residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity                 0.089     0.023              -0.049
## volatile.acidity              0.064     0.071              -0.097
## citric.acid                   0.094     0.114               0.094
## residual.sugar                1.000     0.089               0.299
## chlorides                     0.089     1.000               0.101
## free.sulfur.dioxide           0.299     0.101               1.000
## total.sulfur.dioxide          0.401     0.199               0.616
## density                       0.839     0.257               0.294
## pH                           -0.194    -0.090              -0.001
## sulphates                    -0.027     0.017               0.059
## alcohol                      -0.451    -0.360              -0.250
## quality                      -0.098    -0.210               0.008
## bound.sulfur.dioxide          0.345     0.194               0.264
##                      total.sulfur.dioxide density     pH sulphates alcohol
## fixed.acidity                       0.091   0.265 -0.426    -0.017  -0.121
## volatile.acidity                    0.089   0.027 -0.032    -0.036   0.068
## citric.acid                         0.121   0.150 -0.164     0.062  -0.076
## residual.sugar                      0.401   0.839 -0.194    -0.027  -0.451
## chlorides                           0.199   0.257 -0.090     0.017  -0.360
## free.sulfur.dioxide                 0.616   0.294 -0.001     0.059  -0.250
## total.sulfur.dioxide                1.000   0.530  0.002     0.135  -0.449
## density                             0.530   1.000 -0.094     0.074  -0.780
## pH                                  0.002  -0.094  1.000     0.156   0.121
## sulphates                           0.135   0.074  0.156     1.000  -0.017
## alcohol                            -0.449  -0.780  0.121    -0.017   1.000
## quality                            -0.175  -0.307  0.099     0.054   0.436
## bound.sulfur.dioxide                0.922   0.504  0.003     0.136  -0.427
##                      quality bound.sulfur.dioxide
## fixed.acidity         -0.114                0.136
## volatile.acidity      -0.195                0.157
## citric.acid           -0.009                0.102
## residual.sugar        -0.098                0.345
## chlorides             -0.210                0.194
## free.sulfur.dioxide    0.008                0.264
## total.sulfur.dioxide  -0.175                0.922
## density               -0.307                0.504
## pH                     0.099                0.003
## sulphates              0.054                0.136
## alcohol                0.436               -0.427
## quality                1.000               -0.218
## bound.sulfur.dioxide  -0.218                1.000

I noticed from the Pearson correlation above that the strongest correlations with volatile acidity are bound sulfur dioxide and quality. The correlation coefficients are 0.157 and -0.195, respectively. Let’s look at the visual representation of the correlations.

We can clearly see from the size and color of the circles that volatile acidity has the strongest correlation with citric acid, quality, and bound sulfur dioxide, as stated above. Thus, the next step will be making bi-variate plot for each of the four variables

Volatile Acidity v/s Citric Acid

The amount of volatile acidity decreases as citric acid increases. Could the citric acid have an effect on the taste of the white wine?

Volatile Acidity v/s Quality

## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1700  0.2375  0.2600  0.3332  0.4125  0.6400 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1100  0.2700  0.3200  0.3812  0.4600  1.1000 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.100   0.240   0.280   0.302   0.340   0.905 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2000  0.2500  0.2606  0.3000  0.9650 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.1900  0.2500  0.2628  0.3200  0.7600 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.2000  0.2600  0.2774  0.3300  0.6600 
## -------------------------------------------------------- 
## wine$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.240   0.260   0.270   0.298   0.360   0.360

The amount of volatile acidity in level 4 of quality would confirm how volatile acidity affects the taste of the wine.

Volatile Acidity v/s Bound Sulfur Dioxide

The amount of volatile acidity increases as bound sulfur dioxide increases.

Let’s also look into alcohol against quality.

Interestingly we observe a trend : as the alcohol percentage increases so do the quality.

Visual of bound vs free sulfur dioxide, showing a positive correlation.

Q & A

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Volatile acidity correlates strongly with citric acid and bound sulfur dioxide.

The amount of volatile acidity decreases as citric acid increases, but the data was widely spread and only showing small clusters of data.

The overlay of jitter data on top of the box plot of volatile acidity against quality create a good visual for comparison of the different qualities.

The visual for volatile acidity against bound sulfur dioxide didn’t really show a good explanation as the data was widely spread, but did show a increase of volatile acidity when bound sulfur dioxide had increased a lot.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

With the new variable that I created, it show good correlation between free sulfur dioxide and bound sulfur dioxide. Also, alcohol against quality showed that as the alcohol percentage increases so do the quality.

What was the strongest relationship you found?

The level of volatile acidity showed a negative correlation with quality showing that the quality of white wine increased.

Multivariate Analysis

Citric Acid v/s Volatile Acidity factored by Quality

The volatile acidity plot elaborate on the odd trends that were seen in the box plots earlier. Most quality levels 6 and above do not exceed 0.75 of volatile acidity.

Bound Sulfur Dioxide vs Volatile Acidity factored by Quality

Most of the different qualities are wide spread but there does seem to be a large grouping from 45-170 grams of bound sulfur dioxide.

Q & A

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

The citric acid plot against volatile acidity showed a good correlation as the quality of white wine increased, even though the correlation was negative.

Were there any interesting or surprising interactions between features?

Surprisingly, we see that higher quality wines are having lower bound sulfur dioxide, which can be seen by difference in shades of green in plot.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

No.

Final Plots and Summary

Plot One

The distribution of volatile acidity appears unimodal with a curious spike around 0.28.

Plot Two

The quality level of different white wines confirmed that as the level increased the volatile acidity was reduced.

Plot Three

The quality of wine increases as we move towards the lower right of the plot. Wine seems to have better quality when citric acid is around 0.15 and volatile acidity is 0.3.


Reflection

This data set contains information on 4,898 different white wines from a 2009 study. My goal was to find which chemical properties affected the volatile acidity in the white wine. I started out by exploring the distribution of individual variables and looked for unusual behaviors in the histograms. I then calculated and plotted the correlations between volatile acidity and the variables. None of the correlations were above 0.5. The two variables that had relatively strong correlations were citric acidity and bound sulfur dioxide, but the individual correlations were not strong enough to make definitive conclusions with only bi-variate analysis methods. However, plotting the multivariate plot shown as Final Plot 3 showed the increase in quality with certain citric acidity values. One suggestion for this data set is to include storage time and storage method since these factors can influence the quality of wine as well. Further studies might include the relationship between price and quality of wine to investigate whether expensive wines lead to better quality.